Variational joint self‐attention for image captioning
نویسندگان
چکیده
The image captioning task has attracted great attention from many researchers, and significant progress been made in the past few years. Existing models, which mainly apply attention-based encoder-decoder architecture, achieve developments captioning. These however, are limited caption generation due to potential errors resulting inaccurate detection of objects incorrect objects. To alleviate limitation, a Variational Joint Self-Attention model (VJSA) is proposed learn latent semantic alignment between given its label description for guiding better Unlike existing VJSA first uses self-attention module encode effective relationship information intra-sequence inter-sequences relationships. And then variational neural inference learns distribution over corresponding description. In decoding, learned guides decoder generate higher quality caption. results experiments reveal that outperforms compared performances various metrics show feasible generation.
منابع مشابه
Joint Learning of CNN and LSTM for Image Captioning
In this paper, we describe the details of our methods for the participation in the subtask of the ImageCLEF 2016 Scalable Image Annotation task: Natural Language Caption Generation. The model we used is the combination of a procedure of encoding and a procedure of decoding, which includes a Convolutional neural network(CNN) and a Long Short-Term Memory(LSTM) based Recurrent Neural Network. We f...
متن کاملContrastive Learning for Image Captioning
Image captioning, a popular topic in computer vision, has achieved substantial progress in recent years. However, the distinctiveness of natural descriptions is often overlooked in previous work. It is closely related to the quality of captions, as distinctive captions are more likely to describe images with their unique aspects. In this work, we propose a new learning method, Contrastive Learn...
متن کاملStack-Captioning: Coarse-to-Fine Learning for Image Captioning
The existing image captioning approaches typically train a one-stage sentence decoder, which is difficult to generate rich fine-grained descriptions. On the other hand, multi-stage image caption model is hard to train due to the vanishing gradient problem. In this paper, we propose a coarse-to-fine multistage prediction framework for image captioning, composed of multiple decoders each of which...
متن کاملPhrase-based Image Captioning
Generating a novel textual description of an image is an interesting problem that connects computer vision and natural language processing. In this paper, we present a simple model that is able to generate descriptive sentences given a sample image. This model has a strong focus on the syntax of the descriptions. We train a purely bilinear model that learns a metric between an image representat...
متن کاملDomain-Specific Image Captioning
We present a data-driven framework for image caption generation which incorporates visual and textual features with varying degrees of spatial structure. We propose the task of domain-specific image captioning, where many relevant visual details cannot be captured by off-the-shelf general-domain entity detectors. We extract previously-written descriptions from a database and adapt them to new q...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Iet Image Processing
سال: 2022
ISSN: ['1751-9659', '1751-9667']
DOI: https://doi.org/10.1049/ipr2.12470